import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import IncrementalPCA
from sklearn.preprocessing import StandardScaler
import plotly.express as px
%load_ext autoreload
%autoreload 2
%matplotlib inline
%run ./support_functions.ipynb
%run ./project_algos.ipynb
pd.options.display.max_colwidth = 500
load data
data = pd.read_csv("data_v4.csv")
Create TFIDF vectors
vectorizer = TfidfVectorizer(stop_words="english")
tfvectors = vectorizer.fit_transform(data.descriptions)
Convert from sparse to dense vectors
vectors = tfvectors.A
vectors.shape
(40283, 44740)
Standardize for PCA
tfX = StandardScaler().fit_transform(vectors)
PCA is impossible on this given system due to RAM limitations. Instead, use Incremental PCA to reduce
n_components = 3
batches = 20
ipca = IncrementalPCA(n_components=n_components, batch_size=batches)
#Xpca = ipca.fit_transform(tfX)
Xpca = np.load("tfidf_Xpca_n3.npy")
targets = data.loc[:, "titles"].values
X = [i[0] for i in Xpca]
Y = [i[1] for i in Xpca]
Z = [i[2] for i in Xpca]
df = pd.DataFrame(list(zip(targets,X,Y,Z)), columns=["title", "x", "y", "z"])
df.head()
| title | x | y | z | |
|---|---|---|---|---|
| 0 | Mash | -0.026043 | -0.025258 | -0.021983 |
| 1 | Little Big Man | -0.021741 | -0.029097 | -0.023552 |
| 2 | Love Story | -0.022841 | -0.025666 | -0.019878 |
| 3 | Two Mules For Sister Sara | -0.031287 | -0.035015 | -0.028571 |
| 4 | The Aristocats | -0.031208 | -0.031382 | -0.026906 |
viz
fig = px.scatter_3d(df, x="x", y="y", z="z", hover_name='title', title="Interactive Plot of Movies, plotted by PCA ")
fig.show()
Eliminate severe outliers.
outliers = ["Az Elvarázsolt Dollár", "Männerpension", "Olsenbanden Gir Seg Aldri!"]
idx = []
for name in outliers:
idx.append(df.loc[df.title==name].index[0])
idx
[5369, 9950, 3622]
df2 = df.drop(idx)
Plot again
df2.columns = ["title","PCA1", "PCA2", "PCA3"]
fig = px.scatter_3d(df2, x="PCA1", y="PCA2", z="PCA3", hover_name="title", title="Interactive Plot of Movies, plotted by PCA ")
fig.show()
fig = px.scatter(df2, x="PCA1", y="PCA2", hover_name='title', title="Interactive Plot of Movies, plotted by PCA ")
fig.show()
Select three movies in relatively close proximity to inspect
names = ["Open Season", "The Conclave", "Little Chenier"]
data.loc[data.titles==names[0]]
| uids | titles | genres | ratings | scores | votes | lengths | directors | stars | descriptions | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1378 | tt0071292 | Open Season | Action, Drama, Thriller | R | 6.1 | 530 | 104 | Peter Collinson | Peter Fonda, Cornelia Sharpe, John Phillip Law, Richard Lynch | Three young men take a young woman and a middleaged man to an isolated cabin, where they are terrorized in different ways. | 1974 |
| 17370 | tt0400717 | Open Season | Animation, Adventure, Comedy | PG | 6.1 | 87017 | 86 | Roger Allers | Ashton Kutcher, Martin Lawrence, Debra Messing, Gary Sinise | Boog, a domesticated 900lb. Grizzly bear, finds himself stranded in the woods 3 days before Open Season. Forced to rely on Elliot, a fast-talking mule deer, the two form an unlikely friendship and must quickly rally other forest animals if they are to form a rag-tag army against the hunters. | 2006 |
data.loc[data.titles==names[1]]
| uids | titles | genres | ratings | scores | votes | lengths | directors | stars | descriptions | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 17833 | tt0452011 | The Conclave | Biography, Drama, History | None | 6.5 | 545 | 100 | Christoph Schrewe | Manu Fullola, Brian Blessed, James Faulkner, Rolf Kanies | In 1458, five years after the fall of Constantinople to the Turk, eighteen cardinals met in Rome to elect a new pope. A 27-year-old Spanish cardinal, Rodrigo Borgia, learns to play a very dangerous game; how to survive his first conclave. | 2006 |
data.loc[data.titles==names[2]]
| uids | titles | genres | ratings | scores | votes | lengths | directors | stars | descriptions | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 18109 | tt0758764 | Little Chenier | Drama | R | 7.0 | 555 | 120 | Bethany Ashton Wolf | Johnathon Schaech, Frederick Koehler, Tamara Braun, Jeremy Davidson | Deep in the bayou sits a floating town called Little Chenier. It is here that Beaux and his mentally challenged brother, Pemon, run a bait-and-tackle shop. Pemon is accused of a crime, and Beaux chooses to protect his brother at all costs. | 2006 |
For future, consider using genres as a clustering factor. At an early stage, the combination of genres may allow for more targeted recommendations.